In this activity, we will perform an analysis of data from a malaria study in Namibia to examine how parasite genetic information can be used to make inferences about malaria transmission. in a particular region. The exercise will demonstrate how various measures of within, and between host parasite genetic diversity, population genetic differentiation, and genetic relatedness are associated with certain aspects of malaria transmission. Through these associations, we will demonstrate how certain analyses of parasite genetic data can provide information that is useful for malaria surveillance, and targeting interventions. Specifically, the activity will cover the computation and implementation of the following metrics:
The learning objectives of this activity are:
The data we present were published by Tessema et al. (2019) in the journal Elife. Briefly, a combination of dried blood spots and Rapid Diagnostic Tests were collected from \(2585\) symptomatic malaria patients at \(29\) health facilities in four health facility districts in northeastern Namibia - Rundu, Nyangana, Andara, and Zambezi. These samples were genotyped using \(26\) microsatellite markers.
Definition: A microsatellite is a tract of repetitive DNA motifs that typically range from one to six base pairs that is repeated multiple (5 - 50) times.
Transmission of malaria in Namibia is generally low compared to many other areas of sub-Saharan Africa, with efforts to reduce transmission over the last decade being largely successful, although progress has slowed in recent years.
The map above shows estimated malaria prevalence from the Malaria Atlas Project in Namibia in 2018. Namibia has relatively low prevalence rates compared to neighboring countries, suggesting lower malaria transmission (though incidence and prevalence are not direct measures of transmission).
Malaria Atlas Project estimates of malaria incidence within the study area (shown above) suggest that incidence is lowest in Zambezi. However, the spatial resolution of these estimates is too coarse to reveal any differences in incidence between the remaining health facility districts (Rundu, Andara, and Nyangana).
We know from other estimates of incidence and prevalence collected at health centers in these four regions that Rundu and Andara districts had higher malaria incidence and higher proportions of imported malaria cases during this time period than Nyangana and Zambezi.
Therefore, for the remainder of this activity we will assume that Rundu, Andara and Nyangana have higher malaria transmission than Zambezi.
To begin, we will examine the parasite genetic data in its raw form in order to gain familiarity with the data and how it is displayed. The genetic data collected in the study can be represented by a matrix (or table). Each sample from an individiual (\(n=2585\)) is a row in our table and each locus or microsatellite in the genome that we genotyped represents a column \((26)\), with one additional column for the sample’s unique identification number, or sample id .
Therefore the table for our genetic data has dimensions \(2585 \times 27\). Each individual entry in the table of data is a list of which alleles were found in that particular sample for that particular locus.
Definition: An allele is the identity of a particular genetic locus or sequence that is inherited between parents and offspring.
Definition: A locus is a fixed position on a chromosome where a particular genetic marker is located.
We read in raw data from an excel spreadsheet or other similar format into the statistical computing software R so that we can manipulate the data more easily, and perform statistical analysis. After reading the data into R, we see what the raw data look like. Below, we show \(6\) samples (rows) and the first \(9\) columns or the first \(8\) microastellite loci (labeled as “AS1”,“AS2”,“AS3”,“AS32”,“AS7”, “AS8”,“AS11”, and “AS12”) here for convenience. If there is more than one allele at a given locus, the alleles will be separated by a semi-colon (;).
| Sample | AS1 | AS2 | AS3 | AS32 | AS7 | AS8 | AS11 | AS12 |
|---|---|---|---|---|---|---|---|---|
| 1071878953 | 195 | 227 | 198 | 165 | ||||
| 1071879259 | 198;192 | 242 | 201;198 | 162;165 | ||||
| 1071879261 | 192 | 227 | 198 | 162;165 | ||||
| 1071879264 | 195;192 | 242 | 171 | 198 | 165 | |||
| 1071879290 | 189;195 | 227;242 | 201;198 | 162;165 | ||||
| 1071879292 | 179;173 | 198;195;192 | 227;242 | 198 | 162;156;159 | 162;165 |
Looking more closely at sample number \(1071879259\) in the second row we see that it has:
Note: Some samples have no allele at a particular locus (eg. sample number \(1071879259\) has no alleles at locus AS1). This is because the we were unable to obtain data for that particular locus, which can happen for a multitude of reasons. We often need to perform some degree of quality control or data cleaning for this reason, for example we may elect to remove samples (rows), which have a lot of missing data and/or loci (columns) which have missing data for a large number of the samples. This is a common occurrence in genetic data sets.
Definition: Sequencing data consists of sequences of DNA obtained from a DNA sequencing reaction
In addition to the genetic data, we also have epidemiological information, or meta data, about each individual including the district where the sample was taken, the identifier for that sample (which is equivalent to the “Sample” column in the genetic data above), and the incidence of malaria for the area the sample was taken from.
We can link the epidemiological information, or meta data, to the genetic data through the sample identifier associated with each sample (which is found in the column “sample_id” in the table).
## district sample_id incidence
## 1 Andara District 1071878953 1.612109
## 2 Andara District 1071879259 1.550435
## 3 Rundu East 1071879261 1.398383
## 4 Rundu East 1071879264 1.476010
## 5 Rundu East 1071879290 1.398383
## 6 Andara District 1071879292 1.550435
In this section we will examine how parasite genetic diversity within and between hosts is associated with malaria transmission intensity across the four sampled regions of Namibia (Rundu, Nyangana, Andara, and Zambezi). In order to do this we will compute the following:
For each sample, using the genotyping data from \(26\) microsatellite loci (shown above), multiplicity of infection (MOI), heterozygozity (\(H_E\)) and within-host fixation index (\(F_{ws}\)) were calculated. In this section, we will show how to calculate of each of these metrics (with examples), explore how these metrics vary across the four regions of Namibia, and describe what each of these metrics can reveal about malaria transmission.
Definition: MOI is a measure of the number of different malaria parasite clones in an individual sample.
Ways of calculating MOI:
MOI \(=\) maximum number of alleles present at any genotyped locus in a sample.
MOI \(=\) the second highest number of alleles present at any genotyped locus in a sample. This is done to address the possibility of false positives in detecting alleles.
MOI can also be statisitically estimated using tools which account for parasites sharing the same allele or that an allele present may not be detected (a false negative), resulting in underestimates, and that a false positive allele may be detected, resulting in over estimates. Software exists for calculating this from binary SNP data (THE REAL McCOIL, Chang et al, 2017) and from multi-allelic loci such as microsatellites and microhaplotypes (MOIRE, https://m-murphy.github.io/moire/ - currently unpublished).
Definition: False positive detection of alleles occurs when genetic data within two samples at a particular locus (in this case microsatellite) appear to be different and are assigned as different alleles, however, the difference is not real and is due to an error.
Definition: False negative detection of alleles occurs when genetic data within two samples at a particular locus (in this case microsatellite) appear to be the same and are assigned as the same alleles, however in reality they are different .
Example Calculation of MOI:
Suppose we look at the genetic data for one sample (number \(1071878953\)):
| Sample | AS1 | AS2 | AS3 | AS32 | AS7 | AS8 | AS11 | AS12 | AS14 | AS15 | AS19 | AS21 | AS34 | AS25 | B7M19 | TA109 | AS31 | Ara2 | PfPK2 | TA1 | TA87 | TA81 | TA60 | PolyA | PFG377 | TA40 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1071878953 | 195 | 227 | 198 | 165 | 212 | 119;122 | 123;120;104;101 | 152;177 | 191 | 181 | 113 | 131 | 209 | 103 | 268;271 |
The maximum number of alleles present at a locus in this sample is \(4\) (AS25, containing alleles 123,120,104,and 101), however in order to account for possible false positives we will designate the MOI of this sample to be the next highest number of alleles present at a given locus, \(2\) (TA109 which contains alleles 152 and 177, and TA40 which contains alleles 268 and 271).
We can repeat this procedure across all samples in our data to obtain the following distribution of MOI across the four sampled regions of Namibia:
Question: How many samples have monoclonal infections (monoclonal means MOI\(=1\))?
Answer: Approximately 600 (use the count on the histogram).
Question: What proportion of samples have \(MOI=3\)?
Answer: Approximately \(20\%\). Calculated by taking the count of samples that have \(MOI=3\) which is approximately \(500\), and dividing by the total number of samples in our data \(2585\). Since \(\frac{500}{2585}\times 100 = 19.34\%\), this is approximately \(20\%\).
A boxplot is is another way we can visualize the MOI values in this data.
Important features of the boxplot:
Now that we understand the overall distribution of MOI in the sampled regions of Namibia we can consider the relationship between MOI and transmission intensity.
Question: What is the relationship between MOI and transmission intensity?
Answer: Generally, high transmission regions have higher average MOI among samples and low transmission regions have lower average MOI among samples. However, in very low transmission settings where many detected cases are imported, the average MOI may be more of a reflection of diversity in the areas where the imported cases originated than diversity of locally aquired cases.
Question: Why do you think there is this relationship?
Answer: In high transmission settings, people often get many infectious mosquito bites fairly regularly, in other words they get re-infected with parasites in rapid succession. This, combined with the fact that many people harboring parasites in high transmission settings are asymptomatic (without symptoms), making them unlikely to get treatment for malaria and thus get rid of these parasites, means that people tend to accumulate high numbers of parasites in high transmission settings. There also tends to be a high genetic diversity of parasites in regions with high transmission (which will be discussed further later in this activity), therefore the chance that the parasites a person might accumulate are genetically distinct clones is higher. This means that a person in a high transmission region is likely to have many genetically distinct parasites at one time, giving them a high MOI.
We can now compare MOIs across the four sampled regions in Namibia and relate MOI to transmission intensity.
Question: Using information from the lecture, as well as information about known transmission intensities in the four regions of Namibia that samples were taken from (Rundu, Andara, Nyangana, and Zambezi), assign each boxplot of MOI to a region:
Answer: (note that the two higher (Rundu and Andara) and two lower transmission regions (Nyangana and Zambezi) are indistinguishable so what really matters is that the high and low transmission regions are correctly assigned)
Andara and Rundu, the higher transmission regions, tend to have higher MOI samples than Nyangana and Zambezi, the lower transmission regions. Generally, MOI values of samples and levels of transmission are positively correlated. In some cases however, regions with very low levels of endemic transmission but high levels of importation can have samples with high MOI.
Now we can calculate a population level measure of genetic diversity, which is Heterozygosity.
Definition: Population Level Heterozygosity or Expected Heterozygosity (\(H_E\)): is a measure of the genetic diversity of the different clones in a population, at a particular genetic locus.
We can calculate expected heterozygosizy or population level heterozygosity of a particular locus with the following equation:
\[H_E= \left(\frac{n}{n-1}\right) \times \left(1-\sum_{i}p_i^2\right)\] where \(n\) is the number of genotyped samples in our population of interest, and \(p_i\) is the frequency of the \(i^{th}\) allele in the population at a particular locus. The value of \(H_E\) is between 0 and 1, with values close to 0 representing low genetic diversity and values close to 1 representing high genetic diversity.
Question: What relationship would we expect between population-level heterozygosity and transmission intensity?
Answer: We would expect higher population level heterozygosity in high transmission settings.
We can use this equation to calculate population level heterozygosity (\(H_E\)) across all four regions of Namibia we sampled from at each locus we genotyped (there are \(26\) points on this boxplot, one for each microsatellite locus):
Question: What is the median \(H_E\) or population level heterozygosity for our sampled loci (\(26\) microsatellite markers)?
Answer: Approximately 0.81 (level of the second highest horizontal line).
Question: What is the first quartile in this distribution of \(H_E\)?
Answer: Approximately \(0.63\) (level of the lowest horizontal line).
As we did with MOI we can compare \(H_E\) values for each locus across the four sampled regions in Namibia and relate \(H_E\) to transmission intensity.
Question:Using information from the lecture, as well as information about known transmission intensities in the four regions of Namibia that samples were taken from (Rundu, Andara, Nyangana, and Zambezi), assign each boxplot of \(H_E\) to a region:
Answer:
Andara, Rundu, and Nyangana have very similar heterozygosity values for the sampled loci, and Zambezi, the lower transmission region has lower heterozygosity values across sampled loci. Heterozygosity of samples and transmission intensity tends to be positively correlated
Another measure of genetic diversity that relates the genetic diversity of a single host to that of a population is within-host fixation index \((F_{WS})\).
Definition: \(F_{ws}\) is a measure of the within-host diversity of an individual infection relative to the population level genetic diversity.
A high \(F_{ws}\) indicates low within-host diversity relative to the population. A low \(F_{ws}\) indicates a high within-host diversity relative to the population.
\(F_{ws}\) is calculated using the following equation: \[F_{ws} =1−H_w/H_s\] Where \(H_w\) is the heterozygosity of the individual and \(H_s\) is the heterozygosity of the local parasite population.
High values of \(F_{ws}\) suggest low within host diversity relative to the population’s diversity. Low values of \(F_{ws}\) suggest high within host diversity relative to the population’s diversity. It is common to calculate a similar value, \(1-F_{ws}\) so that there is a positive relationship between the value and within host diversity: \[1-F_{ws}= 1-\left(1-\frac{H_w}{H_s}\right) = \frac{H_w}{H_s}\]
\(1-F_{ws}\) has the opposite relationship to within host diversity relative to the population’s diversity. High values of \(1-F_{ws}\) suggest high within host diversity relative to the population’s diversity. Low values of \(1-F_{ws}\) suggest low within host diversity relative to the population’s diversity.
In higher transmission areas, individuals are more likely to be be infected by multiple clones and therefore the within-host diversity of samples from this area is likely to be higher, and so \(1-F_{ws}\) is likely to be higher. In contrast, individuals in low transmission regions are more likely to have monoclonal infections or polyclonal infections with lower MOI and therefore within-host diversity is likely to be lower. This metric is closely related to MOI.
Question: Using information from the lecture, as well as information about known transmission intensities in the four regions of Namibia that samples were taken from (Rundu, Andara, Nyangana, and Zambezi), assign each boxplot of \(1-F_{ws}\) to a region:
Answer:
Andara and Rundu, the higher transmission regions, tend to have samples with higher \(1-F_{ws}\) values (lower \(F_{ws}\)) than Nyangana and Zambezi, the lower transmission regions. Recall, high values of \(F_{ws}\) are associated with lower transmission and high values of \(1-F_{ws}\) have the opposite relationship to transmission, they are associated with higher transmission.
A slightly more indirect way to measure the genetic diversity of a population is to measure how closely related (genetically) pairs of infections are within a population. This measure is also associated with transmission intensity.
We can measure the relatedness of two parasites by considering how similar their sequences are. Two sequences that are identical at a given position or marker are said to be identical by state at this location. To summarise relatedness between two samples across the whole genome, we can use an identity by state (IBS) metric.
Definition: IBS is the proportion of shared alleles in two samples across all genotyped markers. High values of IBS suggest that parasites from two different samples are highly related.
For monoclonal samples, this can be expressed in equation form: \[IBS= \frac{\mbox{Number of shared alleles in two samples across genotyped markers}}{\mbox{Number of genotyped markers}}\]
For polyclonal samples, IBS at a particular locus can be defined as, for a pair of samples, \(X\) and \(Y\), the total number of shared alleles \(S_{i}\) across the two samples divided by all alleles at that loci for sample \(X\), \(X_{i}\) and all alleles at that loci for sample \(Y\), \(Y_{i}\). This can be summed across all loci to produce an overall estimate of IBS.
\[IBS= \frac{1}{n} \sum_{i=1}^{n} \frac{S_{i}} {X_{i}Y_{i}} \] Question: Use this equation to calculate IBS at a single locus with the following microsatellite data: Sample X, Allele 1 = 198, Allele 2 = 192 Sample Y, Allele 1 = 195, Allele 2 = 192
Answer: As the samples both have a copy of allele 192, shared alleles, \(S_{i}\) is 2. Both samples \(X\) and \(Y\) have two alleles at this locus, so \(X_{i}\) and \(Y_{i}\) are both \(2\), multiplied together \(X_{i}Y_{i}\) equals 4. As a result IBS at this locus is \(\frac{2}{4}\) or \(0.5\).
The proportion of shared alleles is not a perfect measure of relatedness, as two samples may have the same alleles by chance even if they are not related (i.e. they do not have a common ancestor). Therefore, for pairs of samples with low proportions of shared alleles, this metric might not tell us much about the relative relatedness. For example, a pair of samples with \(IBS=0.01\) might not be more related than a pair with \(IBS=0.005\). IBS is also affected by errors in the sequencing process and the MOI of each sample and doesn’t take into account allele frequencies, i.e. infections sharing rare alleles is stronger evidence of them being related than if they share common alleles. Estimates of Identity By Descent (see lecture materials and supplement), while more complicated to calculate, overcome some of these issues
To try and reduce this uncertainty, we can classify pairs of samples as “highly related” or “not highly related” if the values of \(IBS\) is above a certain threshold. This allows us to ignore small differences and hopefully see more meaningful differences. Here we define a pair of samples as highly related if \(IBS>0.6\), although this threshold can be varied.
To measure relatedness between all samples in a population, we can look at the proportion of pairs of samples that are highly related. In a population that is more related we would expect to see a higher proportion of highly related pairs of samples.
Question: Why are so few pairs of samples highly related?
Answer: Due to recombination and mutation, only samples that are very close together in the transmission chain of who-infected-whom will be highly related. Because not all infected individuals are sampled, unless transmission is very low, most samples will be unrelated.
Definition: Recombination is the process by which homologous chromosomes exchange information (alleles) during meiosis, after gametes mate in the mosquito.
Definition: Mutation is a change in the genetic sequence of an organism, such as a SNP, insertion, deletion, or duplication.
Question: Using information from the lecture, as well as information about known transmission intensities in the four regions of Namibia that samples were taken from (Rundu, Andara, Nyangana, and Zambezi), assign each bar to a district.
Answer:
Zambezi and Nyangana, regions with relatively low transmission, have the highest proportion of highly related pairs of infections. Rundu and Andara, regions with relatively high transmission have a low proportion of highly related pairs. With the exception of very low transmission settings that are nearing elimination, low transmission settings tend to have a greater proportion of pairs of samples that are highly related. On the other hand, higher transmission settings tend to have a lower proportion of highly related pairs of samples.
We can also look at the overall distribution of relatedness between pairs by district (the dashed line shows the cutoff for pairs to be classified as highly related):
Question: What do you notice about the distributions?
Answer: This is a very open ended question and there may be other valid and interesting observations not listed here, but some features to note include that very few samples are highly related, some regions have more normal distributions whilst others are skewed more towards lower relatedness values.
As well as looking at measures of genetic diversity of parasites to better understand malaria transmission intensity, we can use genetic data to understand connectivity across regions by looking at genetic differentiation (or relatedness) between the different subpopulations in different regions.
Definition: Connectivity between two locations describes the flow of malaria parasites from one region to another.
We can also use genetic relatedness measured by IBS between districts to estimate how connected two districts are. When there is more connection between two regions - i.e. people travel between the two regions, bringing parasites with them which can be transmitted to the local population if they are bitten by a competent mosquito vector - then districts will tend to share more highly related pairs. This can help determine whether transmission in one area may affect transmission in another, guiding the coordination of interventions in a region. We can compare pairwise relatedness between the four districts in Namibia:
The regions with the largest proportion of highly related pairs among them are three geographically closest regions, Rundu, Nyangara and Andara. Comparisons between Zambezi and Rundu, Zambezi and Andara, and Zambezi and Nyangara have the lowest proportion of highly related pairs, consistent with geographic distance. The least connected regions tend to have lower proportions of highly related pairs by IBS.
Definition: Fixation index, or \(F_{st}\) and Jost’s D are both measures of genetic differentiation between different subpopulations.
When \(F_{ST}\) is close to 1 it means that individuals in a population tend to carry a small number of alleles. Individuals are “differentiated” in the sense that alleles are not freely mixing between people.
When Jost’s D is close to 1 it means that alleles tend to be present in a small number of individuals. Individuals are “differentiated” in the sense that they contain different genetic material.
High values of \(F_{st}\) and Jost’s D both suggest a greater genetic differentiation between the subpopulations being measured. However both metrics have different approaches to measure genetic differentiation.
Question: Using information from the lecture, as well as information about known transmission intensities in the four regions of Namibia that samples were taken from (Rundu, Andara, Nyangana, and Zambezi), which pairs of districts do you think will have the greatest values Jost’s D measures and \(F_{st}\) measures, in other words, which districts will have the greatest genetic differentiation between them?
Answer: We can plot pairwise comparisons between the four districts.
The pairwise comparisons with Zambezi (Rundu-Zambezi, Andara-Zambezi, Nyangana-Zambezi) have the highest values of \(F_{st}\) and Jost’s D, which is consistent with Zambezi being a low transmission region, as well as the geographical distance between Zambezi sampling sites and the other three regions.
Both high Jost’s D and \(F_{st}\) are consistent with populations being more differentiated from each other, either through less mixing (\(F_{st}\)) or different genetic material (Jost’s D)
Question:Compare and contrast results obtained by FST, Jost’s D, to the inter district relatedness by IBD and IBS. How are they different? How are they the same? Why might there be differences?
Answer: The highest values of FST and Jost’s D were in pairwise comparisons between Zambezi and other regions, suggesting the highest levels of differentiation both in terms of the populations containing different genetic material and in alleles not flowing as freely between those populations. This is also reflected in Zambezi containing a lower proportion of highly related pairs by IBS with other regions.